What affects road safety in the UK?

Import note book instructions

1. Introduction

Traffic accidents have the most significant impact on mortality rates other than human diseases. There are some facts according to brake.org:

More than 1.35 million people die on the world's roads, and millions more are seriously injured every year.

Road deaths are the eighth highest cause of death for people of all ages.

Road deaths are the number one killer of those between the ages of 5-29.

In the traditional mindset, female drivers are always more prone to traffic accidents because of their 'poor' driving skills. But is this the case? Are young people more likely to be involved in traffic accidents than middle-aged people? When are traffic accidents more likely to happen, at Night or during the day? In which city or clusters in the UK are traffic accidents most likely to occur? How about the weather, road, and junction conditions.

In this project, we will answer all of these questions. We hope to find the characteristics, spatial and temporal distribution, and future trends of traffic accidents by exploring the Road Safety database with data analysis and visualization.

The advent of covid-19 was a huge disaster for humanity, and many people lost their lives and got sick due to covid-19, but it also changed human travel habits - reducing traffic trips and increasing the time spent at home. Therefore, we will further analyze the relationship between epidemics and road safety.

2. Dataset

2.1 Introduction and Description of Dataset

In this project, we need 2 kinds of dataset, one is dataset relating to road safety, the other is about covid-19.

Road Safety Data

Road Safety Dataset is published by Department for Transport((https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data), last updated on 25 November 2021 with an open government license.

The dataset provides detailed road safety data about the circumstances of personal injury road accidents in GB from 2016, the types of vehicles involved, and the consequential casualties in 3 sub-dataset - Accident dataset, Casualty dataset, and Vehicles dataset. The statistics relate only to personal injury accidents on public roads reported to the police and subsequently recorded using the STATS19 accident reporting form.

Covid Data

The Covid-19 dataset to be analyzed in this project from ( https://coronavirus.data.gov.uk), published by the UK government, last updated on 20 January 2022.

2.2 Data Acquisition

Road Safety

We mainly use the urllib library to manipulate web page URLs and to read the content of web pages, then save the data we read in csv format.

Covid Data

We used the API created by the gov.uk website to access their dataset on covid-19, find out the covid-19 information for each day, and select what we needed: newcase and mortality.

2.3 Data cleaning

2.3.1 Initial Data Cleaning

Road Safety

To simplify the analysis of the data that follows and to improve the efficiency of the analysis, we do the data cleaning in the following steps:

  1. Drop unwanted columns from 3 types of datasets relating to road safety.
  2. Merge these 3 datasets and drop duplicate columns.
  3. Check the dataset for NA values.

Different data visualisations require different versions of the data framework. Initial data cleaning is done in this section, further ad hoc data p grouping and other cleaning methods will be carried out throughout the notebook.

Covid Data

2.3.2 Reshape Data

Road Safety

Many data in the road safety dataset are an index presented by a number, such as 'sex_of_driver = 1' means male, 'weather condition = 4' means rain, etc. So to show more information directly in the following visualization, we have renamed some of the data, taking reference from the 'Guide Index' provided by the UK Transport Department.

We use a for loop to first change all types of the column we need to strings for facilitating the replacement of numeric indicators with text. Secondly, as dates and times are relatively special variables, we need to convert them into data time objects. Also, to facilitate the subsequent data analysis, we merged and reclassified the time variables, dividing the 24 hours of the day into five phases: morning rush, office hours, afternoon rush, evening and Night.

ARIMA Prediction

3.Data analysis and visualisation

3.1 When do traffic accidents usually happen?

To examine the high incidence of traffic accidents over the period 2016-2020, we first analyzed the change in the number of traffic accidents over the period, every month (monthly and every three months). We then explicitly analyzed which weekdays and periods had the highest traffic accidents.

Accidents per month

The end of each year is a high accident period and the total number of traffic accidents decreases each year

As the graph shows, the high incidence of traffic accidents is concentrated at the end of each year, with a downward trend in the number of accidents from the end of the year to the beginning of the following year. This suggests that the increase or decrease in traffic accidents is likely to be related to the holidays at the end of each year. At the same time, the overall trend is that the accidents are decreasing year on year. This indicates that the overall safety situation on the roads is gradually improving.

Accidents by Times and Weekdays

The afternoon rush and office hours are the most accident-prone times of the day, with the Friday afternoon peak often being the most accident-prone time of the week.

The heat map shows significantly fewer crashes during the afternoon rush hour on weekends than on weekdays, while there are substantially more crashes on weekend evenings than on weekdays. More people go out in the evenings on weekends, and people have more nightlife than on weekdays, so there are more traffic accidents at Night on weekends than on weekdays.

3.2 Who is more likely to be involved in a traffic accident?

In this section, we depicted portraits of drivers and casualties in the traffic accident.

Males are more likely to be involved in traffic accidents than females. Male drivers are twice as likely as female drivers to be involved in a traffic accident. 70% of the casualties were men.

In 2020, the population of the United Kingdom was over 67 million, with 33.94 million females and 33.15 million males, which means that the number of males and females is well balanced in the UK. According to Statista, there is an 11% gap, with 81% of men holding a licence and 70% women. Therefore, we can calculate that the probability of a female driver being involved in a crash is 1.5%(352527/[33.94million70%]), compared to 2.9%(788352/[33.15million81%]) for male drivers over the past five years.

As for casualties, the number of male casualties was 726,276( 59.12%) and 502,263 female casualties.

Thirty-year-olds are most likely to be involved in traffic accidents and become victims. Young people who have just received their driver's licence need to stay alert.

The age of 16-18 and 30 are the two cut-off points that signal a significant increase and decrease in the likelihood of a traffic accident, respectively.

The number of traffic accidents and casualties increased significantly between the ages of 16-18, with the number of casualties rising from 13.39k to 28.64k, an increase of 114%, and the number of drivers involved increasing from 3.98k to 22.3k, an increase of 4.6 times. Because young people in the UK can take the test to obtain a legal driver's licence at age 16, the increase in the sample base also led to a significant increase in the number of 16-18 year old drivers involved in traffic accidents. At the same time, 18-year-olds in the UK can drink alcohol legally. Drunk driving is the leading cause of traffic accidents and exacerbates their severity.

3.3 Under what circumstances are traffic accidents more likely to occur?

3.3.1 Weather Condition: accidents in foggy days are more severe

Serious and fatal accidents are more likely to occur on foggy or mist and high winds days. Serious traffic accidents are more likely to occur on sunny days than on rainy or snowy days.

Here, we divide weather conditions into two categories: meteorological and wind. As expected, severe or fatal traffic accidents are more likely to occur on fog or mist days in meteorological conditions. However, contrary to our expectations, the number of accidents on fine days is higher than on rainy and snowy days. The accidents are often more severe, with a percentage of 20%, 19%, and 16% each. That's because, on sunny days, people tend to be subjectively more careless, getting distracted by scenery nearby. In contrast, people will be more careful on rainy and snowy days at a lower driving speed in a relaxed mood.

3.3.2 Road Condition:

20, 30, 40, 50, 60, 70 are the only valid speed limits on public highways

The dual carriageway is more prone to fatal traffic accidents. Nighttime lighting reduces the severity of traffic accidents. Traffic accidents get more severe when they exceed a higher speed limit

A single-carriageway has more traffic accidents than a dual carriageway among the different road types. Of the 91,726 traffic accidents on the dual carriageway, 1.73% were fatal accidents with the highest percentage. In a roundabout, the probability of having a severe traffic accident is 0.36%. Because there are vehicles from both ends of the road in the dual carriageway, and when two cars are crashing towards each other, it is easy to produce a strong rush impact. However, in a single carriageway, the possibility of a slight accident is the least (79.3%); however, in slip road, a slight accident is the most (90.2%). This is because speeds tend to be higher on the single carriageway, while speed limits are set on the slip road. Reducing the driving speed will reduce the accident's impact, thus mitigating the accident.

Of the 20,744 accidents in the no lighting environment, 4.75% were severe, compared with 1.1% of the 428,051 accidents in daylight. 72% of accidents occurred during the day because more people and cars travel during the day; 20% of the number of accidents occurred in the unlit environment. Therefore, the severity of traffic accidents can be reduced by adjusting the brightness of lights at Night.

60% of traffic accidents occur on roads with a speed limit of 30miles/hour, and 13 per cent of traffic accidents occur on roads with a speed limit of 60miles/hour. Because roads with speed limits of 30miles/hour are often sharp curves, broken roads and arch bridges, drivers and pedestrians are often too inflexible to make new judgments in these places with special topography and are more likely to have traffic accidents. Suppose traffic accidents occur in areas with higher speed limits. In that case, traffic accidents tend to be more serious; the proportion of serious traffic accidents with a speed limit of 60 miles/hour is 3.6%. In traffic accidents with a speed limit of 20, the ratio of serious is only 0.48%.

3.3.3 Junction Conditions:

58% of accidents occur at intersections, 50% of which occur at T or staggered junction; stop signs and authorized persons can help reduce the number of accidents

58% of accidents occur at intersections, 42% of accidents not at junctions or within 20 metres. Intersections are hubs of road traffic, and traffic flows in different directions from more conflict points and interweaving points at intersections. Drivers are often distracted from driving at intersections because they think about their travel route. As a result, intersections tend to be a high incidence point for traffic accidents. Of all accidents, 30% occur at T or staggered junctions, 9.5% at crossroads, and 8.2% at roundabouts, so extra care is needed when approaching these two types of intersections. In the slip road, when driving downhill, the driver often takes the operation method of turning off the engine and coasting to save fuel. In an emergency, it is too late to take emergency measures.

45% of accidents occur at give way or uncontrolled junctions, 11% at auto traffic signals, 0.6% at junctions with a stop sign, and only 0.3% at intersections with an authorized person. Therefore, to some extent, installing a stop sign at a junction can effectively prevent traffic accidents.

3.4 Spatial Analysis: In which cities or regions traffic accidents occur more frequently?

3.4.1 Cities: London and Birmingham have the highest number of traffic accidents

By zooming in on the interactive map, we can see a black dot on the graph that represents the highest number of traffic accidents in this location in the last five years combined - 12.55k people. This is Birmingham, the second-largest city in the UK besides London. Because the number of accidents is divided by authority districts, it is impossible to show the enormous number of accidents in London as a whole. However, the number of accidents in London is the highest in the UK, as seen from the borough data alone. Meanwhile, Leeds, Cornwall and Wiltshire also have more traffic accidents.

3.4.2 Cluster analysis : Locate accident hotspots in the Central London

This section used clustering (DBSCAN) to identify accident hotspots in central London in 2020. DBSCAN groups point that are densely packed together and marks points outside of these groups as noise. Therefore, locations with high accident densities will be highlighted as clusters using this algorithm. We then use folium to plot the locations of these clusters.

  1. First, we need to filter the data we need.

  2. As DBSCAN needs to use an indicator when calculating the distance between points. We create a function that takes the latitude and longitude of two points and calculates the distance (in metres) between them.

  3. To ensure that the earth's curvature is taken into account when calculating the distance, we also created geopy's great_circle function.

The traffic accident cluster is most remarkable near Liverpool Street Station and Monument Station in central London.

In city of London, there are five accident hotspots clusters, each shown in a different colour.

From left to the right in the above figure, the light blue cluster includes the junction of Charterhouse St., West Smithfield, and Fairingdon St., near Holborn and LSE. The light green cluster is at the intersection of Londonwall and Moorgate. Pink cluster locates on the crossroads of Cannon St., Kings William St., and Grace Church St. Grass green cluster locates on the Bishopsgate St., Wormwood St., and Camomole St, near Liverpool St. Station, where the trend towards agglomeration is most evident. Navy cluster mainly locates on the Aldgate High St.

All clusters are located in popular, crowded areas, especially near T-roads, intersections, and metro stations.

3.5 ARIMA: Covid accelerates reduction in the number of traffic accidents

Visualize Covid Data

In March 2020, the Covid-19 epidemic began to spread in the UK.

Since March 2020, the number of new cases and deaths of the Covid-19 epidemic in the UK has started to rise. Some people have started to work from home and reduce their daily travel.

Tips: After clicking on the new confirmed case in the legend, this curve will be cancelled. As a result, the images of the other three data sets will be more visible.

Monthly Trend

Step1: Extract the average number of accidents in each month. Visualise the data using time-series decomposition that allows us to decompose our time series into three distinct components:

This part aims to predict the number of future road accidents in the UK by implementing time Series forecasting methods – ARIMA. ARIMA stands for Auto-Regressive Integrated Moving Average. There are seasonal and Non-seasonal ARIMA models that can be used for forecasting. 3 terms characterize an ARIMA model: p, d, q where p is the order of the AR term, q is the order of the MA term, and d is the number of differences required to make the time series stationary. If a time series has seasonal patterns, you need to add seasonal terms, and it becomes SARIMA, short for ‘Seasonal ARIMA’. More on that once we finish ARIMA.

Step 2: Fitting the ARIMA Model. Below are the examples of parameter combinations for seasonal ARIMA. This step is parameter Selection for our ARIMA Time Series Model. Our goal here is to use a “grid search” to find the optimal set of parameters that yields the best performance for our model.

Step3: Evaluation of forecasts: In order to understand the accuracy of our forecasts, we compare predicted number of accidents to the real number of accidents of the time series, and we set forecasts to start at 2017–01–01 to the end of the data. Find out the MSE to see the accuracy of our model. The mean squared error (MSE) is largely used as a metric to determine the performance of an algorithm. In addition, MSE is the average of the square of the difference between the observed and predicted values of a variable.

The orange and blue lines overlap well with each other, representing a good model fit.

Step 4: Visualising Forecasts. We set the step to 21, means the model will predict the number of next 21 months. As we can see in the below graph the number of road accident in UK will be declined until the end of 2021.

After the occurrence of the Covid, the traffic accidents accelerated to decrease.

Here, we use traffic accident data from March 2020 (before the outbreak of Covid-19) as a sample and use an ARIMA model to predict and compare them with actual values.

After the occurrence of the Covid, the traffic accidents accelerated to decrease. In the graph, all monthly forecasts (blue dashed line) are higher than the true values (solid red line). This is especially true for April 2020, February 2021, and October-November 2021. This is because strict quarantine policies enacted by the government or Omicron emerged and spread widely during this time, causing people to reduce their travel activities and thus reduce traffic accidents. The next graph shows the relationship between covid and traffic accident data more clearly.

Daily Trend

New death cases and actual accidents show an inverse variation relationship.

From April 2020-August 2021, we can see four cycles of New death cases and actual accidents interacting up and down. When New death cases decrease, real daily accidents increase, and vice versa. The convexity of the two curves is always opposite.

3.6 Random Forest: Speed limits has the greatest impact on traffic accidents

Here we extract all the variables used above to make more accurate predictions about the factors that affect the severity of traffic accidents using variables such as driver age, junction detail, weather conditions as well as travel time and speed limit conditions.

Step 1: As we are using a tree-based model here, rather than a distance-based one, we can handle different ranges of features. Therefore, scaling is not required. We use the train_test_split function to select train data and test data from the sample at a random proportion (20%).

Step 2: The advantages of the random forest algorithm in terms of classification effectiveness are its high classification accuracy, low generalisation error and ability to handle high-dimensional data, and the advantages of the training process are the fast learning process and the ease of parallelisation. However, when the distribution of data categories is unbalanced, i.e. the number of sample instances in one category is much smaller than the number of samples in the other categories, the random forest algorithm will have a series of problems such as poor classification results and large generalisation errors. So we need to further check the distribution of data categories

Step 3: As we can see, we obtained a target with highly imbalanced classes, so we can't apply the best strategy and can't collect more data, especially from minority class. For this case we can use model evaluation metrics that are more appropriate for the unbalanced class: confusion matrix, precision or ROC curve instead of accuracy. Alternatively, we can use the class weight parameter included in some implementations of the model, which allows us to have the algorithm adjust for imbalanced classes. Here we focus on class weight parameters.

Step 4: We found that random forests using the weight_class parameter did not perform very well in classifying severity levels. Therefore, let us try a resampling strategy to properly handle our unbalanced target classes. Synthetic Minority Over-sampling Technique (SMOTE). Here, we repeatedly sample from a minority class and replace it so that it is equal in size to the majority class.

In the final result, the relative characteristics of the speed limit are of the highest importance, with office hours second. In real life, excessive driving speeds are often a direct cause of traffic accidents. Fine and rainy days have comparatively high relative feature importance in the weather indicator. In the 24 hours of the day, office hours and evening hour traffic accidents are more important. For policymakers, reducing the severity of traffic accidents can also start with limiting speeds and taking care to divert traffic during peak periods.

4. Conclusion

In this project, we first introduced the background and dataset of the study. And we downloaded the data of Road safety and Covid-19 through API and URL. Meanwhile, we loaded the UK-wide base map with the help of the API of Local Authority Districts. Finally, we make the following conclusions.

  1. Temporal distribution: In 2016-2020, the total number of accidents is highest at the end of each year, with most traffic accidents occurring during the morning and evening rush hours on weekdays.

  2. Sex: Unlike the traditional impression, males (2.94%) are more likely to be involved in traffic accidents than females (1.48%), so male drivers are twice as likely as female drivers to be involved in a traffic accident. And 70% of the casualties were men.

  3. Age: Thirty-year-olds are most likely to be involved in traffic accidents and become victims. However, after this time, the number of accidents gradually decreases. The age of 16-18 is another cut-off point that signals a significant increase in the likelihood of a traffic accident when reaching the age of legally obtaining driving and alcohol permission. Thus, we would suggest that - young people who have just received their driver's licence need to stay alert.

  4. Weather Condition: Serious and fatal accidents are more likely to occur in foggy, mist, and high winds. Serious traffic accidents are more likely to occur on sunny days than rainy or snowy days. So, drivers need to stay calm on sunny days as that old saying tells us - Sing before breakfast, you'll cry before supper.

  5. Junction condition: 58% of accidents occur at intersections, 50% of which occur at T or staggered junctions; stop signs and authorised persons can help reduce the number of casualties. From the driver's perspective, before passing through a traffic intersection, it’s supposed to maintain a high concentration level to avoid traffic accidents. From the policy maker's perspective, stop signs and signals should be installed at intersections where accidents are likely to occur. If possible, you can hire an authorised personnel to manage the traffic at the meeting.

  6. Spatial Analysis: Two largest cities in the UK - London and Birmingham have the highest traffic accidents. The traffic accident cluster is most remarkable near Liverpool Street Station and Monument Station in central London.

  7. Covid and road safety: In March 2020, the Covid-19 epidemic spread in the UK. After the occurrence of the Covid, the traffic accidents accelerated to decrease. New death cases and actual accidents show an inverse variation relationship.

  8. Random Forest: among all the indexes in the road safety dataset, the speed limit has the highest relative feature importance.